Skip to content

[Refactor] Seperate colocate and disaggregated produce flow#1884

Merged
jayhenry merged 2 commits into
InternLM:mainfrom
jayhenry:sep_code
Jun 11, 2026
Merged

[Refactor] Seperate colocate and disaggregated produce flow#1884
jayhenry merged 2 commits into
InternLM:mainfrom
jayhenry:sep_code

Conversation

@jayhenry

@jayhenry jayhenry commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

@YanhuiDua

Copy link
Copy Markdown
Collaborator

@claude review

@jayhenry

Copy link
Copy Markdown
Collaborator Author

@claude review

Comment thread xtuner/v1/rl/agent_loop_manager/produce_utils.py
Comment thread xtuner/v1/rl/agent_loop_manager/disagg_producer.py Outdated
Comment thread xtuner/v1/rl/agent_loop_manager/agent_loop_manager.py
Comment thread xtuner/v1/rl/agent_loop_manager/disagg_agent_loop_manager.py
Comment thread xtuner/v1/rl/agent_loop_manager/produce_utils.py Outdated
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Claude: ## Summary

Clean structural refactor that separates the colocate and disaggregated produce flows into distinct classes (AgentLoopManager + ProduceStrategy vs DisaggAgentLoopManager + DisaggProduceStrategy), with shared utilities extracted to produce_utils.py. The decomposition is well-motivated and the code extraction is faithful to the original semantics.

Issues

Warning

  • [agent_loop_manager.py:258] Missing try/finally around asyncio.gather(*produce_futures) — if one strategy raises, rollout controller stays in generating state and sibling strategies' pending tasks are orphaned. The old code had try/finally for cleanup.
  • [produce_utils.py:773] _put_claimed_tasks calls task.result() without exception handling. A failed generation task will crash the produce loop rather than being handled as FAILED.
  • [disagg_producer.py:314] Early return when total_target <= 0 skips claiming pending tasks from prior rounds, leaving completed tasks (potentially holding routed-experts refs) alive until the next non-zero round.

Nit

  • [disagg_agent_loop_manager.py:342] Edge case: when train_step == 1, current_model_step == self._model_step == 0 so the EXPIRED_BATCH condition falls through to RuntimeError. Likely unreachable but fragile.
  • [produce_utils.py:392] stale_threshold or 1 would silently override a legitimate 0; use is not None check instead.

ProduceBatchResult impact: not affected

Field semantics, accounting, and population logic are preserved through the refactor. The ProduceBatchResult dataclass moves to produce_utils.py with identical field definitions.

RoutedExperts impact: not affected

Routed-experts ownership flows through unchanged put_generated_group / reset_rollout_response paths. One minor temporary-retention concern noted in the disagg_producer early-return finding above, but no permanent leaks introduced.

Verdict

APPROVE — the refactoring is sound and well-tested. The findings above are all low-to-moderate severity (resource cleanup on error paths, edge cases). None affect correctness in the normal produce flow.

@jayhenry jayhenry merged commit c2231af into InternLM:main Jun 11, 2026
5 of 6 checks passed
braisedpork1964 pushed a commit to braisedpork1964/xtuner that referenced this pull request Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants